23 research outputs found

    Analysis of Speaker Clustering Strategies for HMM-Based Speech Synthesis

    Get PDF
    This paper describes a method for speaker clustering, with the application of building average voice models for speakeradaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance. Index Terms: Statistical parametric speech synthesis, hidden Markov models, speaker adaptatio

    Statistical parametric speech synthesis using conversational data and phenomena

    Get PDF
    Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods

    The Temporal Delay Hypothesis: Natural, Vocoded and Synthetic Speech

    Get PDF
    Including disfluencies in synthetic speech is being explored as a way of making synthetic speech sound more natural and conversational. How to measure whether the resulting speech is actually more natu-ral, however, is not straightforward. Conventional approaches to synthetic speech evaluation fall short as a listener is either primed to prefer stimuli with filled pauses or, when they aren’t primed they prefer more fluent speech. Psycholinguistic reaction time experiments may circumvent this issue. In this pa-per, we revisit one such reaction time experiment. For natural speech, delays in word onset were found to facilitate word recognition regardless of the type of delay; be they a filled pause (um), silence or a tone. We expand these experiments by examining the effect of using vocoded and synthetic speech. Our results partially replicate previous findings. For natural and vocoded speech, if the delay is a silent pause, significant increases in the speed of word recognition are found. If the delay comprises a filled pause there is a significant increase in reaction time for vocoded speech but not for natural speech. For synthetic speech, no clear effects of delay on word recognition are found. We hypothesise this is be-cause it takes longer (requires more cognitive re-sources) to process synthetic speech than natural or vocoded speech

    Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation

    Get PDF
    In this paper we present evidence that speech produced spontaneously in a conversation is considered more natural than read prompts. We also explore the relationship between participants’ expectations of the speech style under evaluation and their actual ratings. In successive listening tests subjects rated the naturalness of either spontaneously produced, read aloud or written sentences, with instructions toward either conversational, reading or general naturalness. It was found that, when presented with spontaneous or read aloud speech, participants consistently rated spontaneous speech more natural- even when asked to rate naturalness in the reading case. Presented with only text, participants generally preferred transcriptions of spontaneous utterances, except when asked to evaluate naturalness in terms of reading aloud. This has implications for the application of MOS-scale naturalness ratings in Speech Synthesis, and potentially on the type of data suitable for use both in general TTS, dialogue systems and specifically in Conversational TTS, in which the goal is to reproduce speech as it is produced in a spontaneous conversational setting

    Disfluencies in change detection in natural, vocoded and synthetic speech

    Get PDF
    In this paper, we investigate the effect of filled pauses, a discourse marker and silent pauses in a change detection experiment in natural, vocoded and synthetic speech. In natural speech change detec-tion has been found to increase in the presence of filled pauses, we extend this work by replicating ear-lier findings and explore the effect of a discourse marker, like, and silent pauses. Furthermore we re-port how the use of "unnatural " speech, namely syn-thetic and vocoded, affects change detection rates. It was found that the filled pauses, the discourse marker and silent pauses all increase change de-tection rates in natural speech, however in neither synthetic nor vocoded speech did this effect ap-pear. Rather, change detection rates decreased in both types of "unnatural " speech compared to nat-ural speech. The natural results suggests that while each type of pause increase detection rates, the type of pause may have a further effect. The "unnatural" results suggest that it is not the full pipeline of syn-thetic speech that causes the degradation, but rather that something in the pre-processing, i.e. vocoding, of the speech database limits the resulting synthesis

    Artificial Personality and Disfluency

    Get PDF
    The focus of this paper is artificial voices with different person-alities. Previous studies have shown links between an individ-ual’s use of disfluencies in their speech and their perceived per-sonality. Here, filled pauses (uh and um) and discourse markers (like, you know, I mean) have been included in synthetic speech as a way of creating an artificial voice with different personali-ties. We discuss the automatic insertion of filled pauses and dis-course markers (i.e., fillers) into otherwise fluent texts. The au-tomatic system is compared to a ground truth of human “acted” filler insertion. Perceived personality (as defined by the big five personality dimensions) of the synthetic speech is assessed by means of a standardised questionnaire. Synthesis without fillers is compared to synthesis with either spontaneous or synthetic fillers. Our findings explore how the inclusion of disfluencies influences the way in which subjects rate the perceived person-ality of an artificial voice. Index Terms: artificial personality, TTS, disfluency 1

    A Lattice-based Approach to Automatic Filled Pause Insertion

    Get PDF
    This paper describes a novel method for automat-ically inserting filled pauses (e.g., UM) into fluent texts. Although filled pauses are known to serve a wide range of psychological and structural functions in conversational speech, they have not tradition-ally been modelled overtly by state-of-the-art speech synthesis systems. However, several recent sys-tems have started to model disfluencies specifically, and so there is an increasing need to create disflu-ent speech synthesis input by automatically insert-ing filled pauses into otherwise fluent text. The ap-proach presented here interpolates Ngrams and Full-Output Recurrent Neural Network Language Mod-els (f-RNNLMs) in a lattice-rescoring framework. It is shown that the interpolated system outperforms separate Ngram and f-RNNLM systems, where per-formance is analysed using the Precision, Recall, and F-score metrics
    corecore